09. Missing Values and Outliers

Missing Values and Outliers

ND320 AIHCND C01 L01 A09 Analyzing Dataset For Missing Values And Imputing Methods

Missing Values and Outliers

Missing values are especially common in healthcare where you may have incomplete records or some fields are sparsely populated

Missing Data Classification

MCAR which stands for Missing Completely at Random. This means that the data is missing due to something unrelated to the data and there is no systematic reason for the missing data. In other words, there is an equal probability that data is missing for all cases. This is often due to some instrumentation like a broken instrument or process issue where some of the data is randomly missing.

MAR refers to Missing at Random and this is the opposite case where there is some systematic relationship between data and the probability of missing data. For example, there might be some missing demographics choices in surveys.

MNAR is a Missing Not at Random and this usually means there is a relationship between a value in the dataset and the missing values.

Understanding why data is missing help with choosing the best imputing method to fill or drop the values in your dataset.

Code Concepts

Create a function to check the percent of missing and zero values you have.

def check_for_missing_and_null(df):
    null_df = pd.DataFrame({'columns': df.columns, 
                            'percent_null': df.isnull().sum() * 100 / len(df), 
                           'percent_zero': df.isin([0]).sum() * 100 / len(df)
                           } )
    return null_df 

Apply that function to the original dataframe
check_for_missing_and_null(dataframe)

View the results and see if there are any values that stand out. Again you may need to deal with different columns in different ways depending on their type and reason for missing or zero values.

Additional Resources

Code

If you need a code on the https://github.com/udacity.

Missing and Zero Values

Which of the following is true about missing and zero values in your data?

SOLUTION: Finding the percentage of missing and zero values can help inform whether to impute or drop values or fields.

Missing Data Classification

QUIZ QUESTION::

Match the correct term to a description.

ANSWER CHOICES:



Description

Term

Women could be less likely to give their weight on a survey.

White cell value Data is missing because a testing machine was improperly calibrated.

Those with low education are not accounted for in a study.

SOLUTION:

Description

Term

Women could be less likely to give their weight on a survey.

Those with low education are not accounted for in a study.

White cell value Data is missing because a testing machine was improperly calibrated.